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Abstract The traditional lower bound estimation method for powerlaw distribu¬ 
tions based on the Kolmogorov-Smirnov distance proved to perform better than 
other competing methods. However, if applied to very large collections of data, 
such a method can be computationally demanding. In this paper, we propose two 
alternative methods with the aim to reduce the time required by the estimation 
procedure. We apply the traditional method and the two proposed methods to 
large collections of data (TV = 500, 000) with varying values of the true lower 
bound. Both the proposed methods yield a significantly better performance and 
accuracy than the traditional method. 
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1 Introduction 

Over the last few decades powerlaw distributions have attracted particular at¬ 
tention for their mathematical properties and appearances in a wide variety of 
scientific contexts, from physical and biological sciences to social and man-made 
phenomena [T][2][n][HE|[H] ■ Differently from Normally distributed data, empirical 
quantities that follow a powerlaw distribution do not cluster around an average 
value, and thus can not be characterized through the mean and standard deviation. 
Nevertheless, the fact that some scientific observations can not be characterized as 
simply as other measurements is often a sign of complex underlying processes that 
deserve further study [?]. A complete introduction to powerlaw distributions along 
with a statistical framework for discerning and quantifying powerlaw behavior in 
empirical data can be found in [7], whereas extensive discussions can be found in 
[8ll9ll l()| . and references therein. Recent advances related to powerlaw fitting and 
statistical hypothesis testing can be found in mm■ 
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Formally, a quantity x follows a powerlaw distribution if its probability distri¬ 
bution is defined as 


p(x) oc x 


where x > 0 and a > 1 is called the scaling parameter of the distribution. Fit¬ 
ting these kind of heavy-tailed distributions requires care, since only few empirical 
phenomena show such a probability distribution for all values of x. Indeed, more 
often only values greater than some minimum value Xmin, he. the so called lower 
bound, follow a powerlaw distribution. 

The traditional lower bound estimation method introduced in [7| is based on 
the computation of the Kolmogorov-Smirnov distances between the empirical and 
the theoretical cumulative distribution functions defined for values x > Xmin when 
x is discrete (x > x m in when x is continuous). Once the Kolmogorov-Smirnov dis¬ 
tances have been computed for all the eligible values of Xmin, the x m i„ associated 
with the smallest distance is chosen as lower bound of the distribution. However, 
if applied to very large collections of data - e.g. the distribution of the number 
of views received by YouTube videos - such a method can be computationally 
demanding, and bootstrap techniques to address the uncertainty in the estimates 
and average over multiple estimations become unfeasible. 

In this paper, we propose two alternative methods with the aim to reduce 
the time required by the traditional estimation procedure. In particular, the first 
proposed method starts to compute the traditional Kolmogorov-Smirnov distance 
from a guess on the true value of the lower bound, and stops the procedure once 
a minimum is reached. The second proposed method is thought for the discrete 
case, where the computation of theoretical cumulative distribution functions in¬ 
volves the calculation of Hurwitz zeta functions, which could be computationally 
binding. Such a method uses the above-mentioned conditions to reduce the number 
of computations, and substitutes the cumulative distribution functions of the tra¬ 
ditional Kolmogorov-Smirnov distance with the corresponding probability mass 
functions, i.e. it is based on the comparison between empirical and theoretical 
probabilities for each x > Xmin- 

This manuscript is organized as follows. In Section 2 we provide some basic 
definitions about continuous and discrete powerlaw distributions. In Section 3 
we first discuss the traditional estimation method, and then we introduce two 
alternative methods which can speed up the estimation procedure. In Section 4 we 
apply the three methods to large collections of data (N = 500, 000) with varying 
values of the true lower bound, showing that both our proposed methods yield a 
significantly better performance and accuracy than the traditional method. Section 
5 is left for some concluding remarks. 

2 Definitions 

Let x represents a quantity whose distribution we are interested in. The probability 
distribution when x is continuous is defined as 



p{x)dx = Pr(x < X < x + dx) 


whereas in the discrete case, when x can assume only positive integers, the 
probability distribution is defined as 
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where x m in > 0 is the lower bound, a > 1 is the scaling parameter, and 


mxn 



n =0 

is the Hurwitz zeta function. Furthermore, the complementary cumulative dis¬ 
tribution function in the continuous case is defined as 



P{x) = 1 - Pr{X <x) = Pr(X > x) 


whereas in the discrete case is defined as 



The complementary cumulative distribution function is often preferred to the 
cumulative distribution function since it allows to show powerlaw distributions in 
doubly logarithmic axes, and thus emphasize their upper tail behavior. 

3 Lower bound estimation 

3.1 Traditional Method 

The traditional method to estimate the lower bound of a powerlaw distribution 
has been introduced in if], Such a method is based on the Kolmogorov-Smirnov 
distance, which is defined as 


D = max( \E — T\ ), 


where E is the empirical cumulative distribution function, and T is the the¬ 
oretical cumulative distribution function of the fitted powerlaw distribution for 
values x > Xmin when x is discrete (x > Xmin when x is continuous, and hereafter 
we refer only to the discrete case for the sake of brevity). Once the Kolmogorov- 
Smirnov distance has been computed for all the possible values Xmim the Xmin 
associated with the smallest value of D is chosen as the lower bound of the pow¬ 
erlaw distribution. 

In jZj it has been proved that the estimation method based on the Kolmogorov- 
Smirnov distance outperforms alternative methods based on the BIC (Bayesian 
Information Criterion) and the Anderson-Darling statistics. Nevertheless, when 
dealing with big data such a method can be computationally demanding for two 
main reasons: 

1. the algorithm needs to compute the Kolmogorov-Smirnov distance for each 
possible Xmi n ; 

2. in the discrete case, the computation of the theoretical cumulative distribution 
function of the fitted powerlaw distributions involves the calculation of Hurwitz 
zeta functions, which can be computationally binding when dealing with large 
data collections. 
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3.2 Proposed Methods 

In this paper, we aim at introducing two lower bound estimation methods in order 
to tackle the above-mentioned drawbacks of the traditional method. We start 
from two simple observations. First of all, often there is no need to compute the 
Kolmogorov-Smirnov distance for all the eligible values of Xmin, since a graphical 
exploratory analysis is usually sufficient to rule out a substantial range of values. 
On the left panel of Figure [TJ we show the complementary cumulative distribution 
function of a random generated distribution with powerlaw tail. Notice that a 
quick look at the plot is sufficient to rule out some eligible values for the lower 
bound. When dealing with big data and large maximal values, we could rule out 
hundreds of possible values, thus reducing the required time to estimate the lower 
bound. Moreover, we know by definition that the Kolmogorov-Smirnov statistics 
computed for all the possible values a'min has a global minimum in correspondence 
to the true lower bound. On the right panel of Figure [T] we show the values of the 
Kolmogorov-Smirnov distance in correspondence to subsequent values of Xmin- 




Fig. 1 Exploratory Analysis and Kolmogorov-Smirnov distances. On the left panel we 
show the complementary cumulative distribution function of a random generated distribution 
with powerlaw tail (JV = 10 4 , a = 2, x rnvn = 100). Notice that a quick graphical analysis 
is sufficient to rule out some eligible values of x m i n . When dealing with big data and large 
maximal values, we could rule out hundreds of eligible values, thus reducing the required time 
to estimate the lower bound. On the right panel we show the values of the Kolmogorov-Smirnov 
statistics in correspondence to subsequent values of Xmi n . 


Taking into account these observations, we propose a first estimation method 
which starts computing the Kolmogorov-Smirnov distances from the eligible value 
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of Xmin that is closest to 

(100 — c) r ■ / \ / \\ 

g-g —— e [mtn{x),max{x)), 

where g is a guess on the true value of the lower bound, and c G [1,100] is the 
confidence in such a guess. The computation of the Kolmogorov-Smirnov distances 
stops when all the differences between the last k distances are positive - i.e. when 
a minimum is reached. The key ideas are two: 

1. since a quick graphical exploratory analysis is often sufficient to rule out a large 
amount of eligible values, we can start computing the Kolmogorov-Smirnov 
distances from the value we think it is the true lower bound; 

2. if our guess is close to the true lower bound, the first minimum of the Kolmogorov- 
Smirnov statistics we meet is the global minimum associated with the true 
lower bound, and hence we can stop the computation once a minimum is 
reached. 

Moreover, since in the discrete case the computation of the theoretical cumu¬ 
lative distributions involves the calculation of Hurwitz zeta functions, we propose 
a second estimation method that further modify the traditional method by sub¬ 
stituting the empirical and theoretical cumulative distribution functions of the 
Kolmogorov-Smirnov distance with the corresponding probability mass functions, 
which are generally faster to compute. More formally, in the second proposed 
method the distance to be computed is defined as 

Dpmf — max] |Tpmf | ); 

where the subscripts indicate that E and T are, respectively, the empirical 
probability mass functions, and the theoretical probability mass function of the 
fitted powerlaw distribution for values x > Xmin- 


4 Results and Discussion 

In this section, we illustrate and discuss the results of a simulatiorf] comparing 
the two proposed methods with the traditional estimation method. We refer to the 
traditional method as estimate_xmin - the name of the lower bound estimation 
function provided by the R package poweRlaw [13] - and to our proposed methods 
as 

1. getXmin: the first proposed method still based on the classical traditional 
Kolmogorov-Smirnov distance; 

2. getXmin2: the second proposed method based on distances between empirical 
and theoretical probability mass functions. 

Both getXmin and getXmin2 are implemented for discrete powerlaw distribu¬ 
tions on the R package staTools na. which is currently available on CRAN. 


1 All the simulations have been performed on a machine with Ubuntu 14.10, R 3.1.1, Intel 
quad-core 4 GHz CPU, 32 Gb RAM. 
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In order to test the three different methods, we generate synthetic data and 
examine both the accuracy and the performance in the estimation of the true lower 
bound. We use data drawn from a distribution with the form 


p(x) 


C{xJ Xmin) x ^ Xmin 

Ce - a(x/Xmin -l) x < Xm , n 


( 1 ) 


where a = 3 and C is a normalization constant. We apply the three estimation 
methods to large (N = 500, 000) collections of data drawn from Eq. |T] with true 
values of x m in varying in 50,100,..., 500. 

In our simulation, we set g equal to the true lower bound with a 90% confidence, 
c = 90 - e.g. when the true lower bound is 500, our proposed methods start to 
compute the corresponding statistics from the possible value of Xmin that is closest 
to 500 — 500 x (100 — 90)/100 = 450, which is a feasible practic^] and thus a 
reasonable assumption. Moreover, both getXmin and getXmin2 stop to compute 
the corresponding statistics once a first minimum is reached, i.e. when all the 
differences between the last k = 5 computed distances are positive. 



true x r 


Fig. 2 Simulations. Both getXmin and getXmin2 (g = true Xmin, c = 90, k = 5) outperform 
the traditional estimation method. 


Figure [2] shows the estimated value of x m i n as a function of the true lower 
bound, indicating that both getXmin and getXmin2 outperform the traditional es¬ 
timation method. Table [l] summarizes the accuracy of the three methods through 
mean squared errors (MSEs), root mean squared errors (RMSEs), and mean ab¬ 
solute errors (MAEs), confirming that both the proposed methods yield a better 
accuracy than the traditional method. 


2 The inspect function provided by the R package staTools allows the user to quickly 
visualize the powerlaw fit for different values of x m i n , thus assisting the user in making a good 
guess on the true value of the lower bound. 
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MSE RMSE MAE 

getXmin 768.3 27.72 24.1 

getXmin2 925.5 30.42 26.5 

estimate xmin 2841.3 53.30 40.5 

Table 1 Estimation accuracy. Mean squared errors, root mean squared errors, and mean 
absolute errors summarizing the accuracy of the lower bound estimates obtained by means of 
three different methods. Both the proposed methods yield a better accuracy than the tradi¬ 
tional method. 


Figure [3] illustrates the time demanded by the different estimation methods, 
indicating that our proposed methods yield a better performance than the tradi¬ 
tional estimation method. 



true x r 


Fig. 3 Performance. Time demanded by the different methods to estimate the lower bound. 
Both our proposed methods yield a better performance than the traditional estimation method. 


5 Conclusions 

The traditional lower bound estimation method for powerlaw distributions proved 
to outperform competing methods based on BIC and Anderson-Darling statistics. 
However, if applied to very large collections of data, such a method can be compu¬ 
tationally demanding, and bootstrap techniques to address the uncertainty in the 
estimates and average over multiple estimations become unfeasible. In this paper, 
we propose two alternative methods with the aim to reduce the time required by 
the estimation procedure. In particular, the first proposed method starts to com¬ 
pute the traditional Kolmogorov-Smirnov distances from a guess on the true value 
of the lower bound, and stops the procedure once a minimum is reached. The sec¬ 
ond proposed method uses the above-mentioned conditions to reduce the number 
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of computations, and substitutes the cumulative distribution functions of the tra¬ 
ditional Kolmogorov-Smirnov statistics with the corresponding probability mass 
functions. We apply the three methods to large collections of data (N = 500,000) 
with varying values of the true lower bound. Both the proposed methods yield a 
significantly better performance and accuracy than the traditional method. 
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